Efficiently Answering Top-k Typicality Queries on Large Databases

نویسندگان

Ming Hua

Jian Pei

Ada Wai-Chee Fu

Xuemin Lin

Ho-fung Leung

چکیده

Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answering top-k typicality queries. We model typicality in large data sets systematically. To answer questions like “Who are the top-k most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations. (1) The randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers. (2) The direct local typicality approximation using VP-trees provides an approximation quality guarantee. (3) A VP-tree can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree data structure. An extensive performance study using two real data sets and a series of synthetic data sets clearly show that top-k typicality queries are meaningful and our methods are practical. ∗The research of Ming Hua and Jian Pei is supported in part by an NSERC Discovery Grant. The research of Ada Wai-Chee Fu is supported in part by the RGC Earmarked Research Grant of HKSAR CUHK 4120/05E. The research of Xuemin Lin is supported in part by the Australian Research Council Discovery Grant DP0666428 and the UNSW Faculty Research Grant Program. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. We thank the anonymous reviewers for their constructive comments, particularly, for pointing out some important related work. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires a fee and/or special permission from the publisher, ACM. VLDB ‘07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Top-k best probability queries and semantics ranking properties on probabilistic databases

There has been much interest in answering top-k queries on probabilistic data in various applications such as market analysis, personalised services, and decision making. In probabilistic relational databases, the most common problem in answering top-k queries (ranking queries) is selecting the top-k result based on scores and top-k probabilities. In this paper, we firstly propose novel answers...

متن کامل

Scalable Continual Top-k Keyword Search in Relational Databases

Keyword search in relational databases has been widely studied in recent years because it does not require users neither to master a certain structured query language nor to know the complex underlying database schemas. Most of existing methods focus on answering snapshot keyword queries in static databases. In practice, however, databases are updated frequently, and users may have long-term in...

متن کامل

Best Position Algorithms for Top-k Queries

The general problem of answering top-k queries can be modeled using lists of data items sorted by their local scores. The most efficient algorithm proposed so far for answering top-k queries over sorted lists is the Threshold Algorithm (TA). However, TA may still incur a lot of useless accesses to the lists. In this paper, we propose two new algorithms which stop much sooner. First, we propose ...

متن کامل

Answering Top K Queries Efficiently with Overlap in Sources and Source Paths

Challenges in answering queries over Web-accessible sources are selecting the sources that must be accessed and computing answers efficiently. Both tasks become more difficult when there is overlap among sources and when sources may return answers of varying quality. The objective is to obtain the best answers while minimizing the costs or delay in computing these answers and is similar to solv...

متن کامل